Introduction

In many countries, housing price is determined by the size of a house. For example, in the same community, the sale price is calculated by using house area and price per square meter. No matter how many bedrooms one house has if two houses are the same size, their sale prices are the same. In this case, there is a linear relationship between the housing price and the house size. In a city with large population density such as Hong Kong, housing price is incredibly high and the only housing option for the majority is apartments.

Fig 1. Hong Kong Night View

However, in the United States, housing price is not determined by a single vector. There are many explanatory variables that contributes to the housing price. People could live in a house with its own garage and pool.

Fig 2. American Housing

Now, let’s take a look at one Kaggle competition: ”House Prices - Advanced Regression Techniques”.

For this project, we will work with the file "test.csv", found in /data/house_prices. The file is from Kaggle: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data.

library(recipes)
library(workflows)
library(dplyr)
library(parsnip)
library(tidymodels)
library(ISLR)
library(ISLR2)
library(tidyverse)
library(glmnet)
tidymodels_prefer()
library(ggplot2)
library(readr)
library(gplots)
library(repr)
library(caret)
library(rpart.plot)
library(vip)
library(discrim)
library(poissonreg)
library(corrr)
library(klaR) # for naive bayes
library(janitor)
library(rsample)

Data Cleaning and Missing Data

After read the csv file, we will use janitor package to clean the names of the variables. Then, we have to check the missing data in the dataset and be familiarized with the dataset variables.

train_data <- read_csv("~/Downloads/house_prices/train.csv") %>%
  clean_names()
test_data <- read_csv("~/Downloads/house_prices/test.csv") %>%
  clean_names()
missing_row <- train_data[!complete.cases(train_data),]
head(missing_row)
## # A tibble: 6 × 81
##      id ms_sub_class ms_zoning lot_frontage lot_area street alley lot_shape
##   <dbl>        <dbl> <chr>            <dbl>    <dbl> <chr>  <chr> <chr>    
## 1     1           60 RL                  65     8450 Pave   <NA>  Reg      
## 2     2           20 RL                  80     9600 Pave   <NA>  Reg      
## 3     3           60 RL                  68    11250 Pave   <NA>  IR1      
## 4     4           70 RL                  60     9550 Pave   <NA>  IR1      
## 5     5           60 RL                  84    14260 Pave   <NA>  IR1      
## 6     6           50 RL                  85    14115 Pave   <NA>  IR1      
## # … with 73 more variables: land_contour <chr>, utilities <chr>,
## #   lot_config <chr>, land_slope <chr>, neighborhood <chr>, condition1 <chr>,
## #   condition2 <chr>, bldg_type <chr>, house_style <chr>, overall_qual <dbl>,
## #   overall_cond <dbl>, year_built <dbl>, year_remod_add <dbl>,
## #   roof_style <chr>, roof_matl <chr>, exterior1st <chr>, exterior2nd <chr>,
## #   mas_vnr_type <chr>, mas_vnr_area <dbl>, exter_qual <chr>, exter_cond <chr>,
## #   foundation <chr>, bsmt_qual <chr>, bsmt_cond <chr>, bsmt_exposure <chr>, …
variables <- names(train_data)
variables
##  [1] "id"              "ms_sub_class"    "ms_zoning"       "lot_frontage"   
##  [5] "lot_area"        "street"          "alley"           "lot_shape"      
##  [9] "land_contour"    "utilities"       "lot_config"      "land_slope"     
## [13] "neighborhood"    "condition1"      "condition2"      "bldg_type"      
## [17] "house_style"     "overall_qual"    "overall_cond"    "year_built"     
## [21] "year_remod_add"  "roof_style"      "roof_matl"       "exterior1st"    
## [25] "exterior2nd"     "mas_vnr_type"    "mas_vnr_area"    "exter_qual"     
## [29] "exter_cond"      "foundation"      "bsmt_qual"       "bsmt_cond"      
## [33] "bsmt_exposure"   "bsmt_fin_type1"  "bsmt_fin_sf1"    "bsmt_fin_type2" 
## [37] "bsmt_fin_sf2"    "bsmt_unf_sf"     "total_bsmt_sf"   "heating"        
## [41] "heating_qc"      "central_air"     "electrical"      "x1st_flr_sf"    
## [45] "x2nd_flr_sf"     "low_qual_fin_sf" "gr_liv_area"     "bsmt_full_bath" 
## [49] "bsmt_half_bath"  "full_bath"       "half_bath"       "bedroom_abv_gr" 
## [53] "kitchen_abv_gr"  "kitchen_qual"    "tot_rms_abv_grd" "functional"     
## [57] "fireplaces"      "fireplace_qu"    "garage_type"     "garage_yr_blt"  
## [61] "garage_finish"   "garage_cars"     "garage_area"     "garage_qual"    
## [65] "garage_cond"     "paved_drive"     "wood_deck_sf"    "open_porch_sf"  
## [69] "enclosed_porch"  "x3ssn_porch"     "screen_porch"    "pool_area"      
## [73] "pool_qc"         "fence"           "misc_feature"    "misc_val"       
## [77] "mo_sold"         "yr_sold"         "sale_type"       "sale_condition" 
## [81] "sale_price"

Simple Linear Regression

Data selection

There are 81 different variables in the dataset. Since simple linear regression only uses numerical data, in this case, all the non-numerical variables is out of consideration. After selecting all the numerical variables from the original dataset, there are 13 predictors and 1 outcome variable left.

select_train <- train_data %>%
  select(overall_qual,overall_cond,year_built, gr_liv_area, bedroom_abv_gr, kitchen_abv_gr, tot_rms_abv_grd, 
         fireplaces, garage_area, open_porch_sf, pool_area, mo_sold, yr_sold, sale_price)
head(select_train)
## # A tibble: 6 × 14
##   overall_qual overall_cond year_built gr_liv_area bedroom_abv_gr kitchen_abv_gr
##          <dbl>        <dbl>      <dbl>       <dbl>          <dbl>          <dbl>
## 1            7            5       2003        1710              3              1
## 2            6            8       1976        1262              3              1
## 3            7            5       2001        1786              3              1
## 4            7            5       1915        1717              3              1
## 5            8            5       2000        2198              4              1
## 6            5            5       1993        1362              1              1
## # … with 8 more variables: tot_rms_abv_grd <dbl>, fireplaces <dbl>,
## #   garage_area <dbl>, open_porch_sf <dbl>, pool_area <dbl>, mo_sold <dbl>,
## #   yr_sold <dbl>, sale_price <dbl>
summary(select_train$sale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
hist(select_train$sale_price, xlab = "sale price", ylab = "counts", xlim = c(30000, 760000), main = "Histogram of sale price")

select_train <- select_train %>%
  mutate(sale_price = log(sale_price))
summary(select_train$sale_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.46   11.78   12.00   12.02   12.27   13.53
hist(select_train$sale_price, xlab = "sale price", ylab = "counts", main = "Histogram of sale price")

typeof(select_train)
## [1] "list"
names(select_train)
##  [1] "overall_qual"    "overall_cond"    "year_built"      "gr_liv_area"    
##  [5] "bedroom_abv_gr"  "kitchen_abv_gr"  "tot_rms_abv_grd" "fireplaces"     
##  [9] "garage_area"     "open_porch_sf"   "pool_area"       "mo_sold"        
## [13] "yr_sold"         "sale_price"

Because we know nothing about the correlation of the variables, we can randomly select one variable as the predictor and check the accuracy of the model by using its R-squared value. For example, we are interested in the relationship between the sale price and total number of rooms the house has.

reg_rms <- lm(sale_price ~ tot_rms_abv_grd, data = select_train)
summary(reg_rms)
## 
## Call:
## lm(formula = sale_price ~ tot_rms_abv_grd, data = select_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4133 -0.1940  0.0071  0.2029  1.0556 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     11.16802    0.03654  305.62   <2e-16 ***
## tot_rms_abv_grd  0.13134    0.00544   24.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3377 on 1458 degrees of freedom
## Multiple R-squared:  0.2856, Adjusted R-squared:  0.2851 
## F-statistic: 582.9 on 1 and 1458 DF,  p-value: < 2.2e-16
ggplot(select_train, aes(x = tot_rms_abv_grd, y = sale_price)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

It has an adjusted R-squared value of 0.2851, which means that only 28.51% of the data is explained by the total number of rooms the house has. Apparently, we are not going to draw the linear regression graph between sale price and each explanatory variables. Indeed, we will draw a correlation graph of all our selected variables and choose the one with the highest correlation coefficient with the outcome variables.

cor_select <- select_train %>%
  correlate()
rplot(cor_select)

cor_select %>%
  stretch() %>%
  ggplot(aes(x, y, fill = r)) +
  geom_tile() +
  geom_text(aes(label = as.character(fashion(r))))+
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  NULL

#### Simple Linear Regression

Based on the correlation graph, we can see that the overall quality of a house has the highest correlation with sale price. Thus, the best simple linear regression model will has overall quality as the explanatory variable and sale price as the dependent variable.

reg_qual <- lm(sale_price ~ overall_qual, data = select_train)
summary(reg_qual)
## 
## Call:
## lm(formula = sale_price ~ overall_qual, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.06831 -0.12974  0.01309  0.13332  0.92438 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.58444    0.02727  388.18   <2e-16 ***
## overall_qual  0.23603    0.00436   54.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2303 on 1458 degrees of freedom
## Multiple R-squared:  0.6678, Adjusted R-squared:  0.6676 
## F-statistic:  2931 on 1 and 1458 DF,  p-value: < 2.2e-16

The adjusted R-squared value is 0.6676 that means the model has an accuracy of 66.76%. The coefficient of the predictor value is 0.236 shows that the overall quality has a positive correlation with the sale price. Due to the limitation of the simple linear regression model, this is the best model we can built so far.

ggplot(select_train, aes(x = overall_qual, y = sale_price)) + 
  geom_point() +
  stat_smooth(method = "lm", col = "red")

Multiple Linear Regression

In this section, we will introduce another model which is multiple linear regression model. As what mentioned in the previous section, adding more predictor variables can help increasing the accuracy of our prediction model. First, we will try to use all the predictors.

reg <- lm(sale_price~.-sale_price, data = select_train)
summary(reg)
## 
## Call:
## lm(formula = sale_price ~ . - sale_price, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86658 -0.08263  0.00559  0.09285  0.52817 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.883e+00  6.572e+00   1.504 0.132814    
## overall_qual     8.994e-02  5.149e-03  17.467  < 2e-16 ***
## overall_cond     5.579e-02  4.280e-03  13.034  < 2e-16 ***
## year_built       3.923e-03  2.024e-04  19.383  < 2e-16 ***
## gr_liv_area      2.348e-04  1.787e-05  13.142  < 2e-16 ***
## bedroom_abv_gr  -8.656e-03  7.573e-03  -1.143 0.253187    
## kitchen_abv_gr  -7.678e-02  2.184e-02  -3.516 0.000452 ***
## tot_rms_abv_grd  1.441e-02  5.681e-03   2.536 0.011320 *  
## fireplaces       7.445e-02  7.746e-03   9.611  < 2e-16 ***
## garage_area      2.999e-04  2.601e-05  11.529  < 2e-16 ***
## open_porch_sf   -7.573e-06  6.968e-05  -0.109 0.913472    
## pool_area       -3.004e-04  1.091e-04  -2.754 0.005958 ** 
## mo_sold          5.523e-04  1.611e-03   0.343 0.731718    
## yr_sold         -3.480e-03  3.271e-03  -1.064 0.287515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1632 on 1446 degrees of freedom
## Multiple R-squared:  0.8345, Adjusted R-squared:  0.833 
## F-statistic: 560.9 on 13 and 1446 DF,  p-value: < 2.2e-16

As we can see there is a significant improve in the model accuracy, but some predictors are not significant in the prediction model. The model assumed that all the predictors can be used as an explanatory variables. However in statistics, predictors has a p-value less than 0.05 is considered as statistically significant. So, we can eliminate those statistically insignificant predictors and does not harm the accuracy of the model.

Reducing Predictor Size

reg <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area + kitchen_abv_gr + fireplaces + garage_area + pool_area, data = select_train)
summary(reg)
## 
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built + 
##     gr_liv_area + kitchen_abv_gr + fireplaces + garage_area + 
##     pool_area, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.90258 -0.08228  0.00475  0.09166  0.52893 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.9584236  0.3981991   7.430 1.85e-13 ***
## overall_qual    0.0910835  0.0050405  18.070  < 2e-16 ***
## overall_cond    0.0555227  0.0042735  12.992  < 2e-16 ***
## year_built      0.0038962  0.0002021  19.282  < 2e-16 ***
## gr_liv_area     0.0002624  0.0000118  22.242  < 2e-16 ***
## kitchen_abv_gr -0.0621671  0.0209142  -2.972  0.00300 ** 
## fireplaces      0.0746647  0.0076916   9.707  < 2e-16 ***
## garage_area     0.0003016  0.0000258  11.690  < 2e-16 ***
## pool_area      -0.0003229  0.0001083  -2.981  0.00292 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1634 on 1451 degrees of freedom
## Multiple R-squared:  0.8336, Adjusted R-squared:  0.8327 
## F-statistic: 908.8 on 8 and 1451 DF,  p-value: < 2.2e-16

Now, we have a smaller group of predictors. but the same R-squared value. When we look more closely to the data, we will find out that only one house in the dataset has a pool, which means more than 99% of the house has a pool area of 0. Sometimes when we have several predictors and a R-squared value, the model may run into a problem called overfitting. Next, we will try to reduced the size of predictors and check the difference in model accuracy.

reg1 <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area, data = select_train)
summary(reg1)
## 
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built + 
##     gr_liv_area + fireplaces + garage_area, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.01708 -0.08287  0.00611  0.09265  0.53421 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.782e+00  3.956e-01   7.033 3.09e-12 ***
## overall_qual 9.437e-02  4.976e-03  18.964  < 2e-16 ***
## overall_cond 5.703e-02  4.261e-03  13.384  < 2e-16 ***
## year_built   3.948e-03  2.024e-04  19.509  < 2e-16 ***
## gr_liv_area  2.486e-04  1.132e-05  21.956  < 2e-16 ***
## fireplaces   7.752e-02  7.640e-03  10.147  < 2e-16 ***
## garage_area  3.014e-04  2.593e-05  11.624  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1642 on 1453 degrees of freedom
## Multiple R-squared:  0.8317, Adjusted R-squared:  0.831 
## F-statistic:  1197 on 6 and 1453 DF,  p-value: < 2.2e-16
reg2 <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces, data = select_train)
summary(reg2)
## 
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built + 
##     gr_liv_area + fireplaces, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93571 -0.08658  0.00574  0.10183  0.55764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.580e+00  3.990e-01   3.960 7.84e-05 ***
## overall_qual 1.052e-01  5.110e-03  20.582  < 2e-16 ***
## overall_cond 5.710e-02  4.453e-03  12.822  < 2e-16 ***
## year_built   4.570e-03  2.039e-04  22.411  < 2e-16 ***
## gr_liv_area  2.816e-04  1.145e-05  24.583  < 2e-16 ***
## fireplaces   7.852e-02  7.984e-03   9.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1716 on 1454 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.8154 
## F-statistic:  1290 on 5 and 1454 DF,  p-value: < 2.2e-16
reg_multi <- lm(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces, data = select_train)
summary(reg_multi)
## 
## Call:
## lm(formula = sale_price ~ overall_qual + overall_cond + year_built + 
##     gr_liv_area + fireplaces, data = select_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93571 -0.08658  0.00574  0.10183  0.55764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.580e+00  3.990e-01   3.960 7.84e-05 ***
## overall_qual 1.052e-01  5.110e-03  20.582  < 2e-16 ***
## overall_cond 5.710e-02  4.453e-03  12.822  < 2e-16 ***
## year_built   4.570e-03  2.039e-04  22.411  < 2e-16 ***
## gr_liv_area  2.816e-04  1.145e-05  24.583  < 2e-16 ***
## fireplaces   7.852e-02  7.984e-03   9.834  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1716 on 1454 degrees of freedom
## Multiple R-squared:  0.816,  Adjusted R-squared:  0.8154 
## F-statistic:  1290 on 5 and 1454 DF,  p-value: < 2.2e-16

Without predictors: ”kitchen abv gr” and ”pool area”, the model remains the same level of accuracy. Thus, ”kitchen abv gr” may be dependent with another predictor variable and ”pool area” can be seen as a zero vector. After eliminated these two predictor vectors, the predictor matrix has the same rank. So, the new model has the same accuracy level, but less predictors.

Training and Testing data

After we have the best model we can achieve using multiple linear regression model. Out of curiosity, we can use machine learning to test our model. First of all, we split the data into training and testing groups. Since the test file that Kaggle provided does not has sale price, we will use the train file instead. Then, we set up a recipe with our multiple regression model and fit our model to the train data. Next, we can predict the sale price of our test data and comparing it with the actual price.

set.seed(123)

select_split <- initial_split(select_train, prop = 0.80,
                                strata = sale_price)
train <- training(select_split)
test <- testing(select_split)
house_recipe <- recipe(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area, data = train)
lm_model <- linear_reg() %>% 
  set_engine("lm")
lm_wflow <- workflow() %>% 
  add_model(lm_model) %>% 
  add_recipe(house_recipe)
lm_fit <- fit(lm_wflow, train)
lm_fit %>%
  extract_fit_parsnip() %>% 
  tidy()
## # A tibble: 7 × 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  2.52     0.434          5.80 8.48e- 9
## 2 overall_qual 0.0953   0.00545       17.5  5.99e-61
## 3 overall_cond 0.0590   0.00477       12.4  4.78e-33
## 4 year_built   0.00407  0.000222      18.3  3.37e-66
## 5 gr_liv_area  0.000252 0.0000130     19.4  9.06e-73
## 6 fireplaces   0.0742   0.00850        8.73 9.10e-18
## 7 garage_area  0.000306 0.0000285     10.8  8.69e-26

By combining the model prediction values with the actual values, we can generate a plot of predicted values vs. actual values.

test_res <- predict(lm_fit, new_data = test %>% select(-sale_price))
test_res %>% 
  head()
## # A tibble: 6 × 1
##   .pred
##   <dbl>
## 1  12.4
## 2  11.7
## 3  11.7
## 4  11.3
## 5  11.9
## 6  11.7
test_res <- bind_cols(test_res, test %>% select(sale_price))
test_res %>% 
  head()
## # A tibble: 6 × 2
##   .pred sale_price
##   <dbl>      <dbl>
## 1  12.4       12.2
## 2  11.7       11.8
## 3  11.7       11.9
## 4  11.3       11.1
## 5  11.9       11.9
## 6  11.7       11.6
test_res %>% 
  ggplot(aes(x = .pred, y = sale_price)) +
  geom_point(alpha = 0.2) +
  geom_abline(lty = 2) + 
  theme_bw() +
  coord_obs_pred()

As we can see that the dots spread along the line, the model did a pretty good job in predicting the sale price. We can also check the RMSE value and R-squared value to evaluate the model accuracy.

predict_metrics <- metric_set(rmse, rsq, mae)
predict_metrics(test_res, truth = sale_price, 
                estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       0.168
## 2 rsq     standard       0.827
## 3 mae     standard       0.118

Summary

From our prediction model and explanatory variables, we find out that in the data set, only one house has a pool.

summary(train$pool_area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    2.54    0.00  738.00

The number of fire places plays an important rule in the prediction model. So, we can guess that those data are collected from a cold place. In the next section, we can add some data about heating and other house conditions in another model.

Classification & Regression Trees Model

In this section, we will explore more explanatory variables. Since we did not consider the non-numerical data, we will use more methods, such as classification and regression tree models to predict the housing prices. As we said in the previous section, these data may come from a cold place, so we want to evaluate the heating condition and air control in the house. Since in cold places, people spend a lot of time indoor and these vectors are definitely what they care about when they are buying a house.

#Since each of the factors evaluate the condtion of the exterior material and heating conditon, we can convert the factors to numerical value.
train_data$exter_cond <- as.numeric(factor(train_data$exter_cond, 
                                  levels = c("Ex", "Fa","Gd", "TA","Po"),
                                  labels = c(5,2,4,3,1) ,ordered = TRUE))
train_data$heating_qc <- as.numeric(factor(train_data$heating_qc, 
                                  levels = c("Ex", "Fa","Gd", "TA","Po"),
                                  labels = c(5,2,4,3,1) ,ordered = TRUE))

Now, we can create a new select dataset and add in our new predictors.

new_select_data <- train_data %>%
  select(overall_qual, overall_cond, year_built, gr_liv_area, fireplaces, garage_area, central_air, heating_qc, exter_cond, sale_price)
head(new_select_data)
## # A tibble: 6 × 10
##   overall_qual overall_cond year_built gr_liv_area fireplaces garage_area
##          <dbl>        <dbl>      <dbl>       <dbl>      <dbl>       <dbl>
## 1            7            5       2003        1710          0         548
## 2            6            8       1976        1262          1         460
## 3            7            5       2001        1786          1         608
## 4            7            5       1915        1717          1         642
## 5            8            5       2000        2198          1         836
## 6            5            5       1993        1362          0         480
## # … with 4 more variables: central_air <chr>, heating_qc <dbl>,
## #   exter_cond <dbl>, sale_price <dbl>

Similar to the first regression part, we will add cross fold validation.

set.seed(131)
new_split <- initial_split(new_select_data, prop =0.8, strata = "sale_price")

new_train <- training(new_split)
new_test <- testing(new_split)

new_fold <- vfold_cv(new_train, v = 5, strata = "sale_price")

Here is a new recipe that we will use in the classification tree model.

price_recipe <- recipe(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area + central_air + heating_qc + exter_cond, data = new_train)  %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors())
class.tree <- rpart(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area + central_air + heating_qc + exter_cond,
                    data = new_train, control = rpart.control(cp = 0.01))

plotcp(class.tree)

printcp(class.tree)
## 
## Regression tree:
## rpart(formula = sale_price ~ overall_qual + overall_cond + year_built + 
##     gr_liv_area + fireplaces + garage_area + central_air + heating_qc + 
##     exter_cond, data = new_train, control = rpart.control(cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] gr_liv_area  overall_qual year_built  
## 
## Root node error: 7.574e+12/1166 = 6495687471
## 
## n= 1166 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.447474      0   1.00000 1.00102 0.081591
## 2 0.123676      1   0.55253 0.55574 0.042324
## 3 0.063138      2   0.42885 0.43228 0.040314
## 4 0.037239      3   0.36571 0.36870 0.028696
## 5 0.022119      4   0.32847 0.33228 0.028287
## 6 0.020687      5   0.30635 0.32322 0.028286
## 7 0.017157      6   0.28567 0.30965 0.027618
## 8 0.013418      7   0.26851 0.29745 0.027263
## 9 0.010000      8   0.25509 0.27643 0.024881
rpart.plot(class.tree)

Other than classification trees, we also want to try regression trees and we will use the recipe from our previous section.

tree_spec <- decision_tree() %>%
  set_engine("rpart")

reg_tree_spec <- tree_spec %>%
  set_mode("regression")

reg_tree_fit <- fit(reg_tree_spec, 
                    sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area + central_air + heating_qc + exter_cond, new_train)

augment(reg_tree_fit, new_data = new_test) %>%
  rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      38992.
reg_tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

reg_tree_wf <- workflow() %>%
  add_model(reg_tree_spec %>% set_args(cost_complexity = tune())) %>%
  add_formula(sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces)

set.seed(3435)
reg_fold <- vfold_cv(new_train)

param_grid <- grid_regular(cost_complexity(range = c(-4, -1)), levels = 5)

tune_res <- tune_grid(
  reg_tree_wf, 
  resamples = reg_fold, 
  grid = param_grid
)
autoplot(tune_res)

We select the best-performing model according to rmse and fit the final model on the whole training data set.

best_complexity <- select_best(tune_res, metric = "rmse")

reg_tree_final <- finalize_workflow(reg_tree_wf, best_complexity)

reg_tree_final_fit <- fit(reg_tree_final, data = new_train)

Now we can see a more complex decision tree than the previous one.

reg_tree_final_fit %>%
  extract_fit_engine() %>%
  rpart.plot()

Bagging and Random Forest Model

Now in this section, we will explore bagging and random forest model with our new predictors and test its accuracy.

bagging_spec <- rand_forest(mtry = .cols()) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("regression")
bagging_fit <- fit(bagging_spec, sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area + heating_qc + exter_cond, 
                   data = new_train)
augment(bagging_fit, new_data = new_test) %>%
  rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      26779.

Let’s check out our precition comparing to the actual test value

augment(bagging_fit, new_data = new_test) %>%
  ggplot(aes(sale_price, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

We can see that our model did a pretty good job, but there is also some outliers at the end.

Then let’s check the importance of the variables and see if it fits our previous prediction.

vip(bagging_fit)

As we can see that the above ground living area plays a big role in the prediction. However, there are extra factors such as fireplace and heating contion also important to the data set.

Then, we will put down a random forest model.

rf_spec <- rand_forest(mtry = 6) %>%
  set_engine("randomForest", importance = TRUE) %>%
  set_mode("regression")

Next, we fit the model,

rf_fit <- fit(rf_spec, sale_price~ overall_qual + overall_cond + year_built + gr_liv_area  + fireplaces + garage_area + heating_qc + exter_cond, data = new_train)
augment(rf_fit, new_data = new_train) %>%
  rmse(truth = sale_price, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      15721.
augment(rf_fit, new_data = new_test) %>%
  ggplot(aes(sale_price, .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5)

This looks very similar to our bagging model. Next, we will see the importance of each variable in the random forest model.

vip(rf_fit)

Conclusion

From all the model we fit and data we trained, we can say that people cares about the area above ground the most and the garage area of the house. But comparing to the model we did in lab7, these data also has important predictors such as fireplace and heating condition. Thus we may conclude that from this Kaggle competition, the housing data may come from a cold place in the U.S or European countries, where has sparse population and more land per people. In Asian countries, there is almost no tradition of fireplaces in the house and they also adopt another housing type in the cities.